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Abstract 

VGGNets have turned out to he effective for object 
recognition in still images. However, it is unable to yield 
good performance by directly adapting the VGGNet mod¬ 
els trained on the ImageNet dataset for scene recognition. 
This report describes our implementation of training the 
VGGNets on the large-scale Places205 dataset. Specifi¬ 
cally, we train three VGGNet models, namely VGGNet-11, 
VGGNet-13, and VGGNet-16, by using a Multi-GPU exten¬ 
sion of Caffe toolbox with high computational efficiency. We 
verify the performance of trained Places205-VGGNet mod¬ 
els on three datasets: MIT67, SUN397, andPlaces205. Our 
trained models achieve the state-of-the-art performance on 
these datasets and are made public available . 

1. Introduction 

Convolutional networks (ConvNets) [5] have achieved 
great success for image classification [4], In the recent 
ILSVRC competition [7], several successful architectures 
were proposed for object recognition, such as GoogLeNet 
[9] and VGGNet [8], However, directly adapting these 
models trained on the ImageNet dataset [2] to the task 
of scene recognition cannot yield good performance. Be¬ 
sides training complicated VGGNets on a large-scale scene 
dataset is non-trivial, which requires large computational 
resource and numerous training skills. In this report, we 
train high-performance VGGNet models for scene recogni¬ 
tion on the Places205 dataset [13]. The contribution of this 
report is twofold: 

• Our trained Places205-VGGNet models achieve the 
state-of-the-art performance on the Places2015 dataset 
[13], As the training of VGGNet is very time con¬ 
suming, we release our models to advance the further 
research on scene recognition. 

• We transfer the trained models to other scene datasets, 
including the MIT67 [6] and SUN397 [12], and ex¬ 


tract ConvNet features off-the-shelf. Our trained 
Places205-VGGNet models achieve the best perfor¬ 
mance on these two datasets. 

2. Implementation Details 

The VGGNets are originally developed for object recog¬ 
nition and detection [8], They have very deep convolu¬ 
tional architectures with smaller sizes of convolutional ker¬ 
nel (3 x 3), stride (1 x 1), and pooling window (2 x 2). 
There are four different network structures, ranging from 
11 layers to 19 layers. The model capability is increased 
when the network goes deeper, but imposing a heavier com¬ 
putational cost. Following original implementation of [8], 
we start with training an 11-layer VGGNet, and then train 
deeper VGGNets subsequently by using the pre-trained 11- 
layer model for initialization. 

Specifically, we implement ConvNets by using the pub¬ 
lic Caffe toolbox [3]. As the computational cost and mem¬ 
ory consumption of VGGNets are much larger than other ar¬ 
chitectures (e.g. GoogLeNet), we use Multi-GPU extension 
of Caffe [11], which is publicly available 2 . Meanwhile, 
this extension provides more data augmentation techniques, 
such as comer cropping strategy and multi-scale cropping 
method, which have been proved to be effective for action 
recognition in videos. Therefore we also adopt these two 
augmentation techniques. 

The training of ConvNets is performed with mini-batch 
gradient descent method, where the batch size is set to 256 
and the momentum is 0.9. To reduce the effect of over¬ 
fitting, the training was regularized by weight decay (the L2 
penalty multiplier set to 0.0005) and dropout for the first 
two fully connected layers (with ratio of 0.5). During train¬ 
ing phase, the images are resized to 256 x 256. For multi¬ 
scale training, we randomly select the width and height of 
cropped regions from {256, 224,198,168}. These cropped 
regions are then resized to 224 x 224 for further processing. 
We start with training the 11-layer VGGNet, where network 
weights are randomly initialized with Gaussian distribution 


'https : //github . com/wanglimin/Places2 05-VGGNet/tree/maihtetrps : //github . com/yjxiong/caffe/tree/action_recog 
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Method 

top-1 val/test 

top-5 val/test 

Places205-AlexNet [13] 
Places205-GoogLeNet [ 1 ] 
Places205-CNDS-8 L10] 

50.4/50.0 

-/55.5 

54.7/55.7 

80.9/81.1 

-/85.7 

84.1/85.8 

Places205-VGGNet-ll 
Places205-VGGNet-13 
Places205-VGGNet-16 

58.6/59.0 

60.2/60.1 

60.6/60.3 

87.6/87.6 

88.1/88.5 

88.5/88.8 


Table 1. Performance comparison of different network architec¬ 
tures on the dataset of Places205. 


(mean set to 0 and deviation set to 0.01). The learning rate is 
initially set as 0.01, and decreased to its -T every 10k itera¬ 
tions. The whole training process stops at 40k iterations. To 
train the 13-layer and 16-layer VGGNets, we initialize the 
first four convolutional layers and first two fully connected 
layers with the pre-trained 11-layer VGGNet. 

For testing the VGGNet models, we follow multi-view 
classification method [4]. Specifically, we randomly crop 
regions of 224 x 224 from four corners and center of the 
image, whose size is 256 x 256. After that, these cropped 
regions are horizontally flipped. Therefore, we obtain 10 
views, each of which is fed into ConvNet models for pre¬ 
diction. The final prediction score is the average value of 
the 10 predictions. 


3. Experiments 

In this section, we describe our experimental details and 
results. The training of VGGNets on the Places205 dataset 
is implemented with a Multi-GPU extension of Caffe [11]. 
In our experiment, we use 4 GTX Titan-X GPUs and the 
whole training time of VGGNet-16 is around 2 weeks. 
To test the performance of our trained Places205-VGGNet 
models, we conduct experiments on three datasets, namely 
Places205, MIT67, and SUN397. 

First we perform evaluation on the Places205 [13] and 
the results are summarized in Table 1. We compare 
with other deep network architectures, like AlexNet [4], 
GoogLeNet [9], and CNDS-8 [10], and observe that VG¬ 
GNets obtain much better performance than theirs on this 
dataset. 

To further verify the effectiveness of Places205-VGGNet 
models on scene recognition, we transfer the learned rep¬ 
resentations to the MIT67 [6] and SUN397 [12] datasets. 
Specifically, we extract fc6 features and normalize them 
with ^ 2 -norm. Then we employ linear SVMs as classifiers 
for scene category prediction. The experimental results are 
shown in Table 2. We compare our Places205-VGGNet 
models with other public model and our models achieve the 
best performance on these two challenging datasets. 


Model 

MIT67 

SUN397 

ImageNet-VGGNet-16 [8] 

67.7 

51.7 

Places205-AlexNet [13] 

68.2 

54.3 

Places205-CNDS-8 [10] 

76.1 

60.7 

Places205-GoogLeNet [1] 

76.3 

61.1 

Places205-VGGNet-11 

82.0 

65.3 

Places205-VGGNet-13 

81.9 

66.7 

Places205-VGGNet-16 

81.2 

66.9 


Table 2. Performance comparison of transferred representations 
from different models on the MIT67 and SUN397 datasets. 


4. Conclusions 

In this report, we describe our implementation of train¬ 
ing the VGGNets on the large-scale Places205 dataset with 
a Multi-GPU extension of Caffe. The trained Places205- 
VGGNet models achieve the state-of-the-art performance 
on three scene recognition benchmarks, namely Places205, 
SUN397, and MIT67. We release our trained Places205- 
VGGNet models for further research in scene recognition. 
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